Chapter 1 - Introduction to IODS course project

Brief introduction

Hello, I am Juho Mononen, a PhD student from UEF who who has some experience with Github and is quite familiar with R and Rstudio. Currently dappling in bioinformatics with a main focus of identifying regulatory sites affected by genetic variation. I am mostly self taught in programming, so participating to unlearn potentially bad habits and to learn better practices with project managing and statistics.

A friend introduced me to this course.

Things I expect to learn

  • Basics of version control and Github usage with R
  • Good practices with R and Rstudio

Chapter 2 - Regression and model validation

General features of the dataset

ASSIST dataset

We are going to analyse a subset of a data set from ASSIST (Approaches and Study Skills Inventory for Students) study which in its’ raw format can be found here. The study was on evaluating different learning approaches in students. The different learning methods covered were DEEP learning STRAtegic learning and SURFace learning (bolded sections of the learning methods represent the variable name in the dataset which is loaded below; students2014).

# Read file that was wrangled
students2014 <- read.delim("Data/learning2014.tsv", header = T, sep = "\t")

# Structure and dimensions
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: int  39 33 33 37 35 38 37 37 34 29 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
# ...and dimensions
dim(students2014)
## [1] 166   7

As can be seen from the str() and dim() calls, the subset that we have consists of 7 variables with 166 observations each. Three of these variables are numeric (deep,stra and surf), one character/categorical (gender) and three integers (age,arttitude and points).

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally)
library(ggpubr)
library(tidyverse)
library(broom)

Getting to know the data

Now that we have our packages set up, we can proceed to take a closer look at the variables. Lets start with summary() which gives us some base statistics for the variables:

summary(students2014)
##     gender               age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :13.00   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:30.25   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :34.00   Median :3.667  
##                     Mean   :25.51   Mean   :33.81   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:37.75   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       stra            surf           points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Although this function provides us with much information, it is not that intuitive to assess larger differences from this output. We can tell that 166 males and females total are between ages 17-55 with mean age of 25.51. Also variables age, attidute and points are by average larger than deep, stra and surf. Lets make some plots to look into this data further. First lets make some adjustments to the students2014 to make it easier to plot with e.g. ggplot().

## Lets gather our multi-column data to three columns to make plotting and data editing with grouping easier
to.ggplot <- students2014 %>% gather(key="Variable", value = "Value", -gender) 
## Check that everything is in order
str(to.ggplot)
## 'data.frame':    996 obs. of  3 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Variable: chr  "age" "age" "age" "age" ...
##  $ Value   : num  53 55 49 53 49 38 50 37 37 42 ...

Now, if we want to perform statistical testing for the variables we should figure out which of the variables are normally and non-normally distributed. We can do this by performing Shapiro-Wilk test to the data. We can do this with the commands described below. Shapiro-wilk test and normality testing in R are described here in more detail.

## Lets test for normality so we can determine the best tests for statistical difference
normality.res <- to.ggplot %>%
  ## perform by groups
  group_by(Variable) %>% 
  ## create data.frame()-ception
  nest() %>% 
  ## perform shapiro-wilk to the nested DFs
  mutate(Shapiro.test = map(data, ~ shapiro.test(.x$Value))) %>% 
  ## use broom::glance to generate shapiro-wilk summary
  mutate(Shapiro.sum = Shapiro.test %>% map(glance)) %>%
  ## unlist the summary to make it more accessible
  unnest(Shapiro.sum)

## Save normally distributed and non-normally distributed variables into variables
normal.values <- normality.res$Variable[which(normality.res$p.value>0.05)]
non.normal <- normality.res$Variable[which(normality.res$p.value<=0.05)]

## Print the results
normal.values
## [1] "stra" "surf"
non.normal
## [1] "age"      "attitude" "deep"     "points"

It would seem that variables stra and surf are normally distributed (Shapiro-Wilk P-value>0.05) while age, attidute, deep and points are not (Shapiro-Wilk P-value \(\leq\) 0.05).

Generating some basic plots for data exploration

Since we don’t really have a grasp on ratio of females/males in our data we can generate a plot for that information. We could also test whether the distributions of certain variables between female and male are different. We can plot a box plot for distributions and perform independent T-test/unpaired two-samples Wilcoxon test depending on variable normality that we tested before.

## generate a simple bar plot for counts of male and female subjects in the data set
p1 <- ggplot(dplyr::select(students2014, gender), aes(x=gender, fill=gender)) +
  geom_bar(colour="black") + 
  ## expand the y-axis so that the count labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  scale_fill_manual(values = c("red", "blue")) +
  theme_bw(12) + 
  ## remove unnecessary x axis title
  theme(axis.title.x = element_blank())

## Plot normally distributed variables as bar plots and perform t.test using stat_compare_means
p2.n <- ggplot(dplyr::filter(to.ggplot, Variable %in% normal.values), 
               aes(x = gender, y = Value, fill = gender, group=gender)) +
  geom_boxplot() + 
  ## expand the y-axis so that the P-value labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) +
  scale_fill_manual(values = c("red", "blue")) + 
  ## Separate variables into different facets
  facet_wrap(~Variable, scales="free", nrow=1) + 
  ## perform T-test
  stat_compare_means(comparisons = list(c("F","M")), method = "t.test", paired = FALSE) +
  theme_bw(12) + 
  ## remove unnecessary legend  as it is in p2.nn
  ## remove unnecessary x axis title
  theme(legend.position = "none",
        axis.title.x = element_blank()) ## remove unnecessary legend  as it is in p2.nn

## Plot normally distributed variables as bar plots and perform t.test using stat_compare_means
p2.nn <- ggplot(dplyr::filter(to.ggplot, Variable %in% non.normal), 
                aes(x = gender, y = Value, fill = gender, group=gender)) +
  geom_boxplot() + 
  ## expand the y-axis so that the P-value labels don't cut off
  scale_y_continuous(expand = expansion(mult = c(0, .2))) +
  scale_fill_manual(values = c("red", "blue")) +
  ## Separate variables into different facets
  facet_wrap(~Variable, scales="free", nrow=1) +
   ## perform T-test
  stat_compare_means(comparisons = list(c("F","M")), method = "wilcox.test", paired = FALSE) +
  theme_bw(12) + 
  ## remove unnecessary y-axis title as it is in p2.n
  ## remove unnecessary x axis title
  theme(axis.title.y = element_blank(),
        axis.title.x = element_blank()) 
## Combine plots with ggpubr's ggarrange() for labeling
ggarrange(p1, p2.n, p2.nn, labels = c("A", "B"), nrow=1, widths = c(1,1,2.3))
**Figure 1:** Basic plots for data exploration. **A)** Counts of data set subjects by gender. **B)** Box plots of the different variables.

Figure 1: Basic plots for data exploration. A) Counts of data set subjects by gender. B) Box plots of the different variables.

There are observations almost two times as many females in the data set as there are males (Figure 1A). Age seems to be the only variable that has different distribution between males and females, with males being higher aged (Wilcoxon P-value<0.05) (Figure 1B).

Performing regression analysis for points

Now that we are familiar with the data set we can move on to regression modelling. We are going to use multiple regression to test how well test-points can be predicted using 3 variables from the data set (more information on multiple regression here). We can select the e.g. the variables with highest correlation with points for the fit. While we are at it we can plot all of the inter-variables correlations in our data set using GGally::ggcor().

## remove gender as it is not numeric and gives error
ggcorr(students2014[colnames(students2014) != "gender"], palette = "RdBu", label = TRUE, label_round=3) 
**Figure 2:** Correlation plot for the different variables

Figure 2: Correlation plot for the different variables

Based on the results of the correlation analysis (Figure 2) we could use variables stra, surf and attidute for our explanatory variables for points. Lets build the model using lm() and view the summary of the results.

lm.data <- lm(points ~ attitude + stra + surf, data = students2014)
summary(lm.data)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.7071  -3.3760   0.4279   3.6677  10.0219 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.87300    3.98295   3.232  0.00149 ** 
## attitude     0.33976    0.07784   4.365 2.26e-05 ***
## stra         0.69501    0.56790   1.224  0.22280    
## surf        -1.36841    0.82397  -1.661  0.09870 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 162 degrees of freedom
## Multiple R-squared:  0.1378, Adjusted R-squared:  0.1218 
## F-statistic: 8.627 on 3 and 162 DF,  p-value: 2.399e-05

Out of the variables used for fitting, only attidute seems to be a good addition to the model with -value < 2.26e-05. For every “global attitude toward statistics” point, exam points rise by ~0.34. Rest of the variables useed for the model did not reach significance (P-value<0.05), although with a looser cut-off (e.g. P-value<0.1) surf aka “Surface approach” seems to have negative impact on test score.

The model however, is quite unimpressive. Even though it is statistically significant (the independent variables explain the dependent variable, P-value<2.399e-05) it does not explain a lot of the whole situation. R2 value can be used to asses how good the model is at explaining the variability in the data. R2 is calculated in the following manner:

\[ R^2 = \frac{\text{Explained variation of the model}}{\text{Total variation of the model}} \]

The higher the R2 (closer to 1) the better. The multiple R2 of the model is a measly 0.1378 which means that the current model fails to predict most of the variability (predicts ~14% of it). In summary, the model explains the dependent variable but only on a small scale.

Assessing the quality of the model

We can now assess the models validity by plotting some diagnostic plots.

par(mfrow = c(2,2))
plot(lm.data, which=c(1,2,5))
**Figure 3:** Plots for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

Figure 3: Plots for Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage

From Figure 3 “Residuals vs Fitted” plot we can see that there are no worrying trends apparent in the plot. This suggest that we can assume linear relationship between dependent and independent variables. In Figure 3 “Normal Q-Q” plot we see that the points follow the reference line quite nicely and thus we can assume normality. In Figure 3” Residuals vs Leverage” we can see that there are no influential points (observations that affects the coefficients in the model summary drastically). If there were influential points we should pin-point the cause behind the outlier and possibly even remove it (if it is from erroneous measurement etc.).

# Date and session info
date()
## [1] "Sun Dec 05 17:38:04 2021"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252   
## [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=Finnish_Finland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] broom_0.7.10    forcats_0.5.1   stringr_1.4.0   dplyr_1.0.7    
##  [5] purrr_0.3.4     readr_2.0.2     tidyr_1.1.4     tibble_3.1.5   
##  [9] tidyverse_1.3.1 ggpubr_0.4.0    GGally_2.1.2    ggplot2_3.3.5  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.7         lubridate_1.8.0    assertthat_0.2.1   digest_0.6.28     
##  [5] utf8_1.2.2         R6_2.5.1           cellranger_1.1.0   plyr_1.8.6        
##  [9] backports_1.3.0    reprex_2.0.1       evaluate_0.14      httr_1.4.2        
## [13] highr_0.9          pillar_1.6.4       rlang_0.4.12       curl_4.3.2        
## [17] readxl_1.3.1       rstudioapi_0.13    data.table_1.14.2  car_3.0-11        
## [21] jquerylib_0.1.4    rmarkdown_2.11     labeling_0.4.2     foreign_0.8-81    
## [25] munsell_0.5.0      compiler_4.1.1     modelr_0.1.8       xfun_0.27         
## [29] pkgconfig_2.0.3    htmltools_0.5.2    tidyselect_1.1.1   rio_0.5.27        
## [33] reshape_0.8.8      fansi_0.5.0        tzdb_0.1.2         crayon_1.4.2      
## [37] dbplyr_2.1.1       withr_2.4.2        grid_4.1.1         jsonlite_1.7.2    
## [41] gtable_0.3.0       lifecycle_1.0.1    DBI_1.1.1          magrittr_2.0.1    
## [45] scales_1.1.1       zip_2.2.0          cli_3.1.0          stringi_1.7.5     
## [49] carData_3.0-4      farver_2.1.0       ggsignif_0.6.3     fs_1.5.0          
## [53] xml2_1.3.2         bslib_0.3.1        ellipsis_0.3.2     generics_0.1.1    
## [57] vctrs_0.3.8        cowplot_1.1.1      openxlsx_4.2.4     RColorBrewer_1.1-2
## [61] tools_4.1.1        glue_1.4.2         hms_1.1.1          abind_1.4-5       
## [65] fastmap_1.1.0      yaml_2.2.1         colorspace_2.0-2   rvest_1.0.2       
## [69] rstatix_0.7.0      knitr_1.36         haven_2.4.3        sass_0.4.0

Chapter 3 - Logistic regression

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools 
library(tidyverse) ## wrangling
library(gridExtra) ## for plotting tables 

General features of the dataset

Dataset & variables

In this session we will be exploring Student Performance Dataset which is based on data collected from students of two Portuguese schools. The data was collected from school reports and questionnaires. The full data can be found from here. Lets have a closer look inside the data:

# Read file that was wrangled + stringsAsFactors for glm in advance
alc <- read.delim("Data/alc.tsv", header = T, sep = "\t")

# Print colnames -> variables
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures.p"
## [16] "schoolsup"  "famsup"     "paid.p"     "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences.p"
## [31] "G1.p"       "G2.p"       "G3.p"       "failures.m" "paid.m"    
## [36] "absences.m" "G1.m"       "G2.m"       "G3.m"       "failures"  
## [41] "paid"       "absences"   "G1"         "G2"         "G3"        
## [46] "alc_use"    "high_use"

The data holds several levels of information for the students, for example, school, age, municipality type (rural/urban) and information about parents’ education and job status. It also has information on extra curricular activities on several levels (e.g. free time after school and going out with friends) on 1-5 scale. However, in this session we are in particular interested of the variables alc_use and it’s categorical version high_use which describe the weekly average and categorized normal/high alcohol usage respectively.

Graphical exploration of the variables

Even though the dataset contains many interesting variables that could relate to high alcohol usage lets select few here and see how, by-eye, they are distributed among students.

## selected variables to be plotted
variables <- c("activities", "goout", "studytime", "absences")

## Generate a list of plots
p.list <- lapply(variables, function(x) {
  
  ## Create a data.frame for plotting with variable name as one columns 
  tmp <- alc[x] %>% dplyr::rename(Value=x) %>% mutate({{x}}:=x)

  ## if value is numeric we want to plot distributions as violin plots
  if (is.numeric(tmp$Value)) {
     ggplot(tmp, aes_string(y="Value", x=x)) +
      geom_violin(fill="dodgerblue4") +
      scale_y_continuous(expand = expansion(mult = c(0, .2))) + # remove spaces of axis/plot
      theme_classic(14) +
      theme(axis.text.x =element_blank())
  ## Else we want to plot bar plots for categorical variable occurrences
  } else {
    ggplot(tmp, aes_string(x="Value")) +
      geom_bar(position = "dodge", fill="dodgerblue4") +
      geom_text(stat='count', aes(label=..count..), vjust=-1) +
      scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
      theme_classic(14) +
      xlab(x)
    }
})
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(x)` instead of `x` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## plot the list of plots
ggarrange(plotlist = p.list, nrow = 1, align = "hv")
**Figure 1.** Interesting variables for alcohol consumption exploration

Figure 1. Interesting variables for alcohol consumption exploration

activities which stands for “extra-curricular activities” seems to be fairly equally distributed in no/yes categories among the students. goout (going out with friends, from 1 - very low to 5 - very high) seems to be quite evenly distributed around 3 meaning that most students feel like they partake in social activities in “normal manner”. studytime (weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)) on the other hand suggests that these students are not that hard working in general. High amount of absences (number of school absences) are rare.

Initial inspection of the relation of the variables and high-use

Lets have a look on how, by eye, these variables relate to alcohol usage and high usage. We’ll also have a quick glance on average alcohol usage and distribution of high-users among students.

## plot bar plot of high vs non high use
p1 <- ggplot(alc, aes(x=high_use, fill=high_use)) +
  geom_bar(position="dodge") +
  scale_fill_brewer(palette = "Set1") +
  geom_text(stat='count', aes(label=..count..), vjust=-1) +
  scale_y_continuous(expand = expansion(mult = c(0, .2))) + 
  theme_bw(14)

## plot distribution of alcohol use and color high use
p2 <- ggplot(alc, aes(x=alc_use, fill=high_use)) +
  geom_histogram(binwidth = 0.5,boundary=0) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(mult = c(0, 0))) + 
  scale_x_continuous(expand = expansion(mult = c(0, 0))) + 
  theme_bw(14)

# for activities variable plot proportions among alchol usage
p3 <- ggplot(alc, aes(x=alc_use, fill=activities)) +
  geom_histogram(binwidth = 0.5, position="fill",boundary=0) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(expand = expansion(mult = c(0, 0))) + 
  scale_x_continuous(expand = expansion(mult = c(0, 0))) + 
  theme_bw(14)

## select numeric variables for plotting correlation plot. Lets use spearman for simplicity.
vars.to.cor <- c("goout", "studytime", "absences", "alc_use") 
p4 <- alc %>% 
  select(any_of(vars.to.cor)) %>% 
  ggcorr(palette = "RdBu", label = TRUE, label_round=3, method=c("everything", "spearman"))

## combine plots
ggarrange(p1,p2,p3,p4, nrow = 2, ncol = 2, labels = LETTERS[1:4])
**Figure 2.** **A)** Distribution of high versus normal use students. **B)** Distribution of alchohol usage among students. Coloring based on high versus normal use. **C)** Proportions of activity partaking students at different alcohol usage levels. **D)** Correlation values for all numeric variables of interest and alcohol usage.

Figure 2. A) Distribution of high versus normal use students. B) Distribution of alchohol usage among students. Coloring based on high versus normal use. C) Proportions of activity partaking students at different alcohol usage levels. D) Correlation values for all numeric variables of interest and alcohol usage.

There are less high-amount alcohol users (high-users) in the students than there are normal users (Figure 2A). Most of the students have also really low alcohol usage compared to the few high-users (Figure 2B). The one categorical value that was selected (activities) could be explored by proportions by alcohol usage. We can observe a slightly higher ratio of extra curricular activities among high and low alcohol usage but there are no major trends visible here (Figure 2C). Correlation analysis of the numeric variables hints that higher amount of school absences and going out with friends is connected with high-use. Time spent studying seems to have negative correlation (Figure 2D). However it is to be noted that all correlation are fairly low.

Based on this exploration we could hypothesize the following: 1. Activities are not predictive of high-use. 2. Absences and going out are predictive of more likely high-use. 3. Time spent studying is predictive of lower use.

Performing logistic regression on the chosen variables

Lets now perform logistic regression to see if we had our predictions/hypotheses correct.

## Read file that was wrangled
m <- glm(high_use ~ goout + studytime + absences + activities, data = alc, family = "binomial")

## compute odds ratios (OR)
OR <- coef(m) %>% exp

## compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
## save the odds ratios with their confidence intervals
res <- as.data.frame(cbind(OR, CI)) %>% 
  rownames_to_column("Variable") %>% 
  dplyr::filter(Variable!="(Intercept)")

## Generate OR plot
p.or <- ggplot(res, aes(x = OR, y = Variable)) + 
  geom_point(colour="red", size=3) + 
  geom_segment(aes(x = `2.5 %`, xend= `97.5 %`, yend=Variable), colour="red", lwd=1) +
  geom_vline(xintercept = 1, linetype="dashed") +
  ylab(NULL) +
  theme_bw(14)

## generate table plot for model summary
p.tab <- summary(m)$coefficients %>% as.data.frame() %>%
  mutate_all(function(x) signif(x, digits = 3)) %>% 
  tableGrob()

## Combine plots
ggarrange(p.or, p.tab, nrow=1, labels = c("A","B"))
**Figure 3.** **A)** Odds ratios for the variables. Points represent the OR and seqments the confidence interval. **B)** Summary of the GL model variables.

Figure 3. A) Odds ratios for the variables. Points represent the OR and seqments the confidence interval. B) Summary of the GL model variables.

It seems that we were somewhat correct on our assumptions. Going out with friends boasts odds ratio (OR) of mighty 2.09 (Figure 3A) and is significantly associated based on model summary (Figure 3B). Studying time has a significant negative (OR=0.57) association with high-use. For the other variables things aren’t so straight forward. Absences, while being significantly associated, with high-use presents only a minimal OR of 1.07. For our categorical variable, activities the shows that there is no significant difference observed between those students partaking in activities compared to those who are not in regards of high-use (Figure 3A-B). These results are quite similar to our hypotheses before, apart from the really low impact of absences.

Based on these results we don’t need to eliminate any of the numerical values from our model. How ever as explained here we should perform likelihood ratio test for activities to test whether we should keep it in the model or not.

test.mod1 <- glm(high_use ~ goout + studytime + absences + activities, data = alc, family = "binomial") # with rank
test.mod2 <- glm(high_use ~ goout + studytime + absences, data = alc, family = "binomial") # without rank

anova(test.mod1, test.mod2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ goout + studytime + absences + activities
## Model 2: high_use ~ goout + studytime + absences
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       365     375.68                     
## 2       366     377.16 -1  -1.4798   0.2238

The test is non-significant (P=0.2238) so we can remove activities from our model.

Exploring the power of the model

Refitting and tabulations

Lets now re-fit and perform some tabulations and prediction to see how strong our model is. Lets start by tabulating the outcomes based on our model

## Generate model with significant variables
m2 <- glm(high_use ~ goout + studytime + absences, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m2, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability>0.5)

# tabulate the target variable versus the predictions. Unclass here enables us to keep 2x2 format with tableGrob
p.tab1 <- table(high_use = alc$high_use, prediction = alc$prediction) %>% unclass() %>% 
  tableGrob()

## plot prop table
p.tab2 <- table(high_use = alc$high_use, prediction = alc$prediction) %>% 
  prop.table() %>% 
  round(digits = 2) %>%
  addmargins() %>% 
  tableGrob()

## Combine plots
ggarrange(p.tab1, p.tab2, nrow=1, labels = c("A", "B"))
**Figure 4.** **A** 2x2 tabulation. **B** Proportion table

Figure 4. A 2x2 tabulation. B Proportion table

It seems that our model has quite good predictability. False-positive and -negative rates are somewhat high (0.32 and 0.21 respectively) (Figure 4A) which can also be seen from proportion table (Figure 4B).

Testing by quessing

Lets test our model predictivness compared to random guessing using the a function from the datacamp exercise (loss_func), and by using some nonsensical probabilities. First lets get the prediction error when we quess that everyone is high-user and secondly that no one is high-user and compare these results to our models prediction error.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

## Lets make a quess with probability of high use as 1
loss_func(class = alc$high_use,
          prob = 1)
## [1] 0.7
## Lets make a quess with probability of high use as 0
loss_func(class = alc$high_use,
          prob = 0)
## [1] 0.3
## Lets make a quess with probability of high as predicted by our model
loss_func(class = alc$high_use,
          prob = alc$prediction)
## [1] 0.2324324

As we can see our model has the lowest prediction error when compared to the random guesses.

10-fold cross validation with the latter model

Additionally we can perform cross-validation for the model to test its’ robustness.

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351

As we can see with cross validation our model has smaller prediction error compared to the model introduced in DataCamp exercise (0.24<0.26).

date()
## [1] "Sun Dec 05 17:38:06 2021"
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Finnish_Finland.1252  LC_CTYPE=Finnish_Finland.1252   
## [3] LC_MONETARY=Finnish_Finland.1252 LC_NUMERIC=C                    
## [5] LC_TIME=Finnish_Finland.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] boot_1.3-28     gridExtra_2.3   broom_0.7.10    forcats_0.5.1  
##  [5] stringr_1.4.0   dplyr_1.0.7     purrr_0.3.4     readr_2.0.2    
##  [9] tidyr_1.1.4     tibble_3.1.5    tidyverse_1.3.1 ggpubr_0.4.0   
## [13] GGally_2.1.2    ggplot2_3.3.5  
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.2         sass_0.4.0         jsonlite_1.7.2     carData_3.0-4     
##  [5] modelr_0.1.8       bslib_0.3.1        assertthat_0.2.1   highr_0.9         
##  [9] cellranger_1.1.0   yaml_2.2.1         pillar_1.6.4       backports_1.3.0   
## [13] glue_1.4.2         digest_0.6.28      RColorBrewer_1.1-2 ggsignif_0.6.3    
## [17] rvest_1.0.2        colorspace_2.0-2   cowplot_1.1.1      htmltools_0.5.2   
## [21] plyr_1.8.6         pkgconfig_2.0.3    haven_2.4.3        scales_1.1.1      
## [25] openxlsx_4.2.4     rio_0.5.27         tzdb_0.1.2         generics_0.1.1    
## [29] farver_2.1.0       car_3.0-11         ellipsis_0.3.2     withr_2.4.2       
## [33] cli_3.1.0          magrittr_2.0.1     crayon_1.4.2       readxl_1.3.1      
## [37] evaluate_0.14      fs_1.5.0           fansi_0.5.0        MASS_7.3-54       
## [41] rstatix_0.7.0      xml2_1.3.2         foreign_0.8-81     tools_4.1.1       
## [45] data.table_1.14.2  hms_1.1.1          lifecycle_1.0.1    munsell_0.5.0     
## [49] reprex_2.0.1       zip_2.2.0          compiler_4.1.1     jquerylib_0.1.4   
## [53] rlang_0.4.12       grid_4.1.1         rstudioapi_0.13    labeling_0.4.2    
## [57] rmarkdown_2.11     gtable_0.3.0       abind_1.4-5        DBI_1.1.1         
## [61] reshape_0.8.8      curl_4.3.2         R6_2.5.1           lubridate_1.8.0   
## [65] knitr_1.36         fastmap_1.1.0      utf8_1.2.2         stringi_1.7.5     
## [69] Rcpp_1.0.7         vctrs_0.3.8        dbplyr_2.1.1       tidyselect_1.1.1  
## [73] xfun_0.27

Chapter 4 - Clustering and classification

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools 
library(tidyverse) ## wrangling
library(MASS) ## for the data and analysis

General features of the dataset

Dataset & variables

In this session we will be exploring Boston dataset which is part of MASS data analysis R package. The dataset contains information on the housing values in suburbs of Boston and was collected for a publication on methodological problems that come with the usage of housing market data in measuring willingness to pay for clean air*.

*Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102.

## load boston dataset to get aa clean version
data("Boston")

## check the contents
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14

The dataset has 14 variables 12 of which seem to be continuous and 2 integer. The dataset has 506 observations. Lets also have a look at the sumary of variables as this will be important to note later.

## load boston dataset to get aa clean version
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

As we can see all of the values are positive and have varying means and scales. For example, nox has values ranging from 0.385-0.871 while for tax the values are in the range of 187-711.

Graphical overview of the data

Lets have a look at the data to see if there are any trends visible between the variables.

## Shamelessly taking borrowing code from here: https://stackoverflow.com/questions/57450913/removing-background-color-for-column-labels-while-keeping-plot-background-color

## this functions creates correlation plots (colour scale for correlation and numeric representation) that can be used with ggpairs 
cor_func <- function(data, mapping, method, symbol, ...){
  
  # get mapping information from aes
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # calculate correlation
  corr <- cor(x, y, method=method, use='complete.obs')
  # set colour scale that is used for all correlations
  colFn <- colorRampPalette(c("brown1", "white", "dodgerblue"), 
                            interpolate ='spline')
  
  # get the index of the colour for the correlation observecd with the variable
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]
  
  # generate text plot with rounded correlation
  ggally_text(
    label = paste(symbol, as.character(round(corr, 2))), 
    mapping = aes(),
    xP = 0.5, yP = 0.5,
    color = 'black',
    ...
  ) +
    # plot the background panel using the correlation colour in fill
    theme(panel.background = element_rect(fill = fill))
}

## Plot pairs plots with ggpairs
ggpairs(Boston, 
        upper = list(continuous = wrap(cor_func, method = 'spearman', symbol = expression('\u03C1 ='))),
        diag = list(continuous = function(data, mapping, ...) {ggally_densityDiag(data = data, mapping = mapping) + 
                  theme(panel.background = element_blank())})) + 
  theme(strip.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank())

nox and crim seem to have highest positive correlation out of the variable pairs while nox and dim seem to have the highest positive correlation. Distributions of the values of variable pairs seem to vary greatly.

Setting up the data set for data analysis

Standardizing variables in the dataset

If we want to perform clustering and related analysis we should standardize the data. We can use stats::scale() for this. With default settings is standardizes variable (creates Z-score) by subtracting the mean and dividing by the standard deviation.

# center and standardize variables and transform the matrix into a dataframe
boston_scaled <- scale(Boston) %>% as.data.frame()

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

As we can see the mean value for each variable after standardization is now zero compared to the original summary which had varying means. Due to centering we also have negative values which the original data set did not have.

Creating test and training datasets

Next we will generate a categoriocal value out of crime rate by quantiles (“low”, “med_low”, “med_high”, “high”) and add “replace” the crim variable with it.

# create a quantiles of crim
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels=c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim) %>% 
# add the new categorical value to scaled data
  dplyr::mutate(crime=crime)

Now we set our train and test datasets by subsetting observations from the scaled Boston dataset. We will use 80% of the observations for training and 20% for testing.

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# set seed for reproducibility
set.seed(123)
# choose randomly 80% of the 
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

#save the correct classes from test data
correct_classes<- test$crime

#remove the crime variable from test data
test<- dplyr::select(test, -crime)

Performing linear discriminant analysis on the data

Linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2500000 0.2500000 0.2450495 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       1.01866313 -0.9066422 -0.08120770 -0.8835724  0.38666334 -0.9213895
## med_low  -0.07141687 -0.3429155  0.03952046 -0.5768545 -0.09882533 -0.3254849
## med_high -0.40148591  0.2162741  0.19544522  0.3637157  0.12480815  0.4564260
## high     -0.48724019  1.0149946 -0.03371693  1.0481437 -0.47733231  0.8274496
##                 dis        rad        tax    ptratio      black        lstat
## low       0.9094324 -0.6819317 -0.7458486 -0.4234280  0.3729083 -0.766883775
## med_low   0.3694282 -0.5395408 -0.4205644 -0.1079710  0.3103406 -0.164921798
## med_high -0.3720478 -0.4349280 -0.3191090 -0.2716959  0.1049654  0.009720124
## high     -0.8601246  1.6596029  1.5294129  0.8057784 -0.6383964  0.900379309
##                 medv
## low       0.47009410
## med_low   0.01548761
## med_high  0.19440747
## high     -0.71294711
## 
## Coefficients of linear discriminants:
##                  LD1         LD2        LD3
## zn       0.148390645  0.74870532 -1.0874785
## indus    0.040971465 -0.38126652 -0.1619456
## chas     0.002460776  0.03963849  0.1699312
## nox      0.312458150 -0.67564471 -1.3104018
## rm       0.011086947 -0.16058718 -0.1572603
## age      0.283482486 -0.38817624 -0.1013491
## dis     -0.079848789 -0.38493775  0.2108038
## rad      3.718978412  0.93123532 -0.4706522
## tax     -0.015197127  0.04230505  1.2889075
## ptratio  0.180294774 -0.01592588 -0.3558490
## black   -0.136724112  0.02948075  0.1288959
## lstat    0.145739238 -0.37823065  0.3345688
## medv     0.061327205 -0.53906503 -0.1509890
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9523 0.0364 0.0113

As we can see from the summary of the model proportions of the crime rate categories are quite equal in our training dataset. Percentage separations achieved by the discriminant functions however are quite different: LD1 captures 95.23% of differences between the groups, while LD2 and LD3 add 3.64 and 1.13 respectively.

We can now plot LDA (bi)plot to see differences among the groups.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1.5)

As we can see from the the arrows in the (bi)plot variable rad shows most discrimination (longest arrow) compared to the other factors with smaller influences. It also seems to be one of the major factors predictive of high crime rate.

Assessing predictiveness of the model

Lets use our test dataset now to check how accurately our model can predict crime rate.

# generate predictions using the test dataset
lda.pred <- predict(lda.fit, newdata=test)

#  cross tabulate the results with the crime categories
table(correct= correct_classes, predicted= lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      10        4    0
##   med_low    2      17        6    0
##   med_high   1       9       13    2
##   high       0       0        1   27
# get by column proportions of classifications to compare predictiveness of the model across the groups 
table(correct= correct_classes, predicted= lda.pred$class) %>% prop.table(margin = 2)
##           predicted
## correct           low    med_low   med_high       high
##   low      0.76923077 0.27777778 0.16666667 0.00000000
##   med_low  0.15384615 0.47222222 0.25000000 0.00000000
##   med_high 0.07692308 0.25000000 0.54166667 0.06896552
##   high     0.00000000 0.00000000 0.04166667 0.93103448

As we can see from the tabulation the model predicts low and high crime rate quite accurately (77% and 93% correct predictions respectively) but is not as good in predicting medium low and high crime rates (47% and 54% correct predictions respectively).

Distance analysis of the variables

Performing Distance analysis

Lets perform distance analysis analysis for the observations in the dataset.

As something that is totally extra to the exercise we will also plot a heatmap with the previously used crime rate quantiles as annotation so we can see if the distances are smaller between observations belonging to same crime rate categories!

# reload boston
data("Boston")

# Z-score variables
boston_scaled <- scale(Boston)

# center and standardize variables
dist.b <- dist(boston_scaled)

# create a quantiles of crim
bins <- quantile(boston_scaled[,"crim"])

# create a character vector of colours based on the crime rate  
crime <- cut(boston_scaled[,"crim"], breaks = bins, include.lowest = TRUE, labels=c("darkblue", "lightblue", "rosybrown", "brown")) %>% as.character()

## lets also plot the distances as a heatmap for fun 
library(gplots, quietly = T)
## Warning: package 'gplots' was built under R version 4.1.2
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(as.matrix(dist.b), trace="none", 
          labRow = FALSE, labCol = FALSE,
          ColSideColors=crime,
          RowSideColors=crime)

As we can see from the heatmap high crime rate categories cluster together quite nicely (darker red colour -> R’s “brown”) and are quite distinct from the bulk of low and medium low groups which are also quite similar to each other (blue colours).

Performing and optimizing K-means clustering

Based on the distance heatmap above we could try starting of the K-means clustering using 3-4 centers. We’ll use 3 for starters.

# Calculate clusters using K means
km <-kmeans(boston_scaled, centers = 3)

# Print the centers
km$centers
##         crim         zn      indus       chas        nox         rm        age
## 1  0.7847946 -0.4872402  1.1450405 -0.2723291  1.0384727 -0.4896488  0.8062002
## 2 -0.3882449  0.2731699 -0.6264383 -0.2723291 -0.5823006  0.2188304 -0.4585819
## 3 -0.2048299 -0.1564737  0.2306535  3.6647712  0.3342374  0.3344149  0.3170678
##          dis         rad        tax    ptratio      black      lstat       medv
## 1 -0.8383961  1.12436056  1.2034416  0.6329916 -0.6397959  0.9277624 -0.7457491
## 2  0.4807157 -0.58641200 -0.6161585 -0.2814183  0.3151747 -0.4640135  0.3182241
## 3 -0.3634565 -0.02700292 -0.1304164 -0.4453253  0.1787986 -0.1976385  0.6422884

Our clusters have somewhat distinct means of values but we could optimize our clustering by performing, deviating from the DataCamp exercise, Gap-statistic based goodness of clustering measure which is more closely explained here. Gap statistic is more thoroughly explained here. Simply put, Gap statistic can be used to choose the number of k, where the biggest jump in within-cluster distance has occurred, based on the overall behavior of uniformly drawn samples*.

One of the main reasons I chose to use gap statistic instead of the within-cluster sum of square (WCSS), that was used in the Datacamp exercise, is that we can decide k (number of clusters) by not being reliant on an interpretation of a graph, but rather an outcome of a equation. In this case we will use the 1-SE rule which looks for the smallest k such that its value f(k) is not more than 1 standard error away from the first local maximum.

Lets perform the analysis and plot gap-statistic with

*Towards Data Science blog post on K-mean and gap-statistic, 28.11.2021

library(cluster)
# perform gap-statistic analysis on the dataset, running 1000 bootstraps
# set seed for bootstrapping reproducibility
set.seed(123)
t.km <- clusGap(boston_scaled, kmeans, 20, B = 1000, iter.max=50) 

# output the number of clusters by 1-SE rule 
k.n <- maxSE(t.km$Tab[,3], t.km$Tab[,4], method = "firstSEmax")

# create a dataframe for plotting
gs.df = data.frame(t.km$Tab, k=1:nrow(t.km$Tab))

# plot with ggplot
ggplot(gs.df, aes(k, gap)) + 
  geom_line() + 
  geom_vline(xintercept = k.n) +
  geom_point() + 
  geom_errorbar(aes(ymax=gap+SE.sim, ymin=gap-SE.sim)) + 
  theme_bw(12)

The optimal amount of clusters is defined as 10 by the analysis.

Visualizing the results

We will now plot a pairs plot to see how these clusters are represented among variable pairs.

# Calculate clusters using K means with optimal n
km.res <-kmeans(boston_scaled, centers = k.n)

# create vector of clusters
clusters <- km.res$cluster

# generate a variable for cluster
to.plot <- boston_scaled %>% as.data.frame() %>% mutate(cluster=paste("cl", clusters, sep = "_"))

# take other variables than cluster for plotting 
v.to.plot <- colnames(to.plot)[colnames(to.plot)!="cluster"]

# plot pairs plot
ggpairs(to.plot,
        # set variables to plot
        columns = v.to.plot, 
        # set colour mapping to clusters annoyingly alpha also plots to legend and did not quickly figure out how to remove it without setting up errors...
        aes(colour=cluster, alpha=0.8), 
        # plot legend in first slot
        legend = 1,
        # change the upper plot to points as well
        upper=list(continuous = "points", combo = "facethist", discrete = "facetbar", na ="na")) + 
  # move legend to bottom
  theme(legend.position = "bottom") 

The plot is quite colorful with 10 clusters indeed but there are noticeable things here still. When we look at the different density plots in the middle (while it is quite hard to look at…) we can see that in the different variable value distributions the clusters separate the observations quite nicely. For example with zn and indus separation of the “cl_8” and “cl_9” clusters differs. In the paired variable point plots we can see, for example, for black the separation of “cl_1” and “cl_3” from the others and for zn the separation of the “cl_6”. This visualization might not be the best but will suffice this time…

date()
## [1] "Sun Dec 05 17:40:03 2021"

Chapter 5 - Dimensionality reduction techniques

Loading packages for easier data wrangling and plotting

For the following steps which include plotting, we are going to use some functions in following packages:

library(GGally) ## plotting
library(ggpubr) ## package which contains ggplot2 + useful tools
library(ggfortify) ## for easier PCA plotting
library(tidyverse) ## wrangling
library(FactoMineR) ## analysis
library(factoextra) ## visualization tools for clustering etc.

General features of the dataset

Overview of the variables

The dataset for this weeks exercise originates from United Nations Development Programme.

# load data
human <- read.delim("Data/human.tsv", header = T, sep = "\t")
# show structure 
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ eduF2M   : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ labF2M   : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ edExp    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ lifeExp  : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ GNIpc    : num  65 42.3 56.4 44 45.4 ...
##  $ matMort  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ birthRate: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parlRep  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
# show summary
summary(human)
##      eduF2M           labF2M           edExp          lifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##      GNIpc            matMort         birthRate         parlRep     
##  Min.   :  1.123   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  5.275   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 13.054   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 46.496   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 27.256   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :908.000   Max.   :1100.0   Max.   :204.80   Max.   :57.50

The dataset has 155 observations and 8 variables. All of the variables are numeric and one of them more specifically integer. The scales of the values are varied, which is due to some of them being ratios of rations (e.g. eduF2M), some percentages (parlRep) etc.

Here are the explanations for these variables: GNIpc = Gross National Income per capita lifeExp = Life expectancy at birth edExp = Expected years of schooling matMort = Maternal mortality ratio birthRate = Adolescent birth rate parlRep = Percetange of female representatives in parliament edu2F = Proportion of females with at least secondary education edu2M = Proportion of males with at least secondary education labF = Proportion of females in the labour force labM Proportion of males in the labour force eduF2M = edu2F/edu2M labF2M = labF/labM

Relationships between variables

Now lets plot the already classic pairs plot to delve into the relationships between the variables.

# this functions creates correlation plots (colour scale for correlation and numeric representation) that can be used with ggpairs 
cor_func <- function(data, mapping, method, symbol, ...){
  
  # get mapping information from aes
  x <- eval_data_col(data, mapping$x)
  y <- eval_data_col(data, mapping$y)
  
  # calculate correlation
  corr <- cor(x, y, method=method, use='complete.obs')
  # set colour scale that is used for all correlations
  colFn <- colorRampPalette(c("brown1", "white", "dodgerblue"), 
                            interpolate ='spline')
  
  # get the index of the colour for the correlation observed with the variable
  fill <- colFn(100)[findInterval(corr, seq(-1, 1, length = 100))]
  
  # generate text plot with rounded correlation
  ggally_text(
    label = paste(symbol, as.character(round(corr, 2))), 
    mapping = aes(),
    xP = 0.5, yP = 0.5,
    color = 'black',
    ...
  ) +
    # plot the background panel using the correlation colour in fill
    theme(panel.background = element_rect(fill = fill)) + 
  theme(strip.background = element_rect(fill = "white"),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) 
}
# draw pairs plot
ggpairs(human,upper = list(continuous = wrap(cor_func, method = 'spearman', symbol = expression('\u03C1 ='))))
**Figure 1.** Overview of the variables in the human dataset

Figure 1. Overview of the variables in the human dataset

As we can see from the from Figure 1. the variables have varying distributions among observations. For example, Gross National Income per capita (GNIpc) is mostly focused around 0 with six outlier(ish) observations quite separate from the rest (>500). Maternal mortality ratio (matMort) and life expectancy at birth (lifeExp) have strongest negative correlation while maternal mortality ratio and adolescent birth rate (birthRate) have strongest positive correlation among the variable pairs.

Principal component analysis

Performinng PCA on unstandardized and standardized data

Next we will perform Principal component analysis (PCA) on both unstandardized and standardized dataset to see how standardizing effects the interpretation of PCA results. It could be noted here that it is generally recommended to normalize data before PCA.

We’ll do both of the plots in single code block for convenience and add some titles to the plots alongside the asked captioning with fig.cap in markdown.

## PCA without scaling
pca.results <- prcomp(human, scale. = F)

## plot with ggfortifys autoplot (more info here: https://cran.r-project.org/web/packages/ggfortify/vignettes/plot_pca.html)
p1 <- autoplot(pca.results,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3) +
  theme_bw(14) +
  ggtitle("Unscaled PCA")

## PCA with scaling
pca.results.sc <- prcomp(human, scale. = T)

## plot with ggfortifys autoplot
p2 <- autoplot(pca.results.sc,
         loadings = TRUE, loadings.colour = 'blue',
         loadings.label = TRUE, loadings.label.size = 3) +
  theme_bw(14) +
  ggtitle("Scaled PCA")

## combine pots
ggarrange(p1, p2, align = "hv", labels = c("A", "B"))
**Figure 2. A) PCA on unstandardized data.** Initial PCA analysis with unstandardized data hints that maternal mortality (`matMort`) could be one on the key segragating factors between the countries. **B) PCA on standardized data.** With standardization we can now more reliably see that there are several variables, such as life expenctancy (`lifeExp`) and adolescent birth rate (`birthRate`), that explain the divergence between countries.

Figure 2. A) PCA on unstandardized data. Initial PCA analysis with unstandardized data hints that maternal mortality (matMort) could be one on the key segragating factors between the countries. B) PCA on standardized data. With standardization we can now more reliably see that there are several variables, such as life expenctancy (lifeExp) and adolescent birth rate (birthRate), that explain the divergence between countries.

Interpreting the outcomes of PCA

When we compare the unscaled (Figure 2A) and scaled data (Figure 2B) we can straight of see the importance of standardizing/scaling/normalization. With unscaled data, Gross National Income per capita (GNIpc) and maternal mortality ratio (matMort) which were both quite heavily skewed right in their distributions with higher values (Figure 1) are major drivers of explained variance in the dataset. This observation is due to the differences in scales of the values. For example, with variable labF (Proportion of females in the labor force) an unit change of 0.1 to 0.9 could clearly be viewed as significant change, but with GNIpc the same numeric change e.g. 300.1 to 300.9 is obviously of less impact and vice versa. When we scale the data we make the variables more comparable and thus can observe better the effects. In Figure 2B we can see that when the scale of the values is accounted for, there are several more variables explaining the observed variance. Life expectancy (lifeExp), maternal mortality ratio (matMort) and adolescent birth rate (birthRate) seem to be fairly connected with PC1 which explains lot more variance compared to PC2 (49.85% vs 16.43% respectively).

Multiple Correspondence Analysis on the tea data

Selecting variables for and performing MCA

For Multiple Correspondence Analysis (MCA) step we will use tea dataset from FactoMineR R package. Some information from here about the data:

The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).

Lets do some summarization of the data using some basic commands.

# load tea
data("tea")
# show structure 
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# show summary
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming          exciting  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255   exciting   :116  
##  Not.feminine:171   sophisticated    :215   slimming   : 45   No.exciting:184  
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##                                                                                
##         relaxing              effect.on.health
##  No.relaxing:113   effect on health   : 66    
##  relaxing   :187   No.effect on health:234    
##                                               
##                                               
##                                               
##                                               
## 

The dataset has 300 observations and 36 variables. 35 of these are categorical variables as factors and one of these is integer variable (age).

When performing MCA it is good practice to leave out variables which have factors with next to none observations. Lets plot some bar plots and select 10 variables to use in the analysis.

# Draw summary in the form of bar plots
gather(tea) %>% ggplot(aes(value)) + 
  geom_bar() + 
  facet_wrap("key", scales = "free") + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8) ,axis.title = element_blank()) ## tidy up
## Warning: attributes are not identical across measure variables;
## they will be dropped
**Figure 3.** Box plots of the counts of different factors across the variables in `tea`.

Figure 3. Box plots of the counts of different factors across the variables in tea.

Lets just select some variables with not super-biased distributions seen in the barplots of Figure 3., like age_Q, breakfast, diuretic, escape.exoticism, exciting, feminine, frequency, SPC, tea.time and sugar.

We’ll do MCA and then plot the results using handy ggplot wrappers from factoextra R package. The plots that we are going to use are scree plot (overview of the inertia retained by the dimensions) and biplot with squared cosine (cos2) coloring of the variables. cos2 measures the degree of association between variable categories and a particular axis.

## selecting 10 variables that have reasonable counts across the categories
vars2mca <- c("age_Q", "breakfast", "diuretic", "escape.exoticism", "exciting", "feminine", "frequency", "SPC", "tea.time", "sugar")

## perform MCA on the selecteed variables
res.mca <- MCA(tea[,vars2mca], graph = F)

## using factoextra visualization functions generate some plot from MCA:
## first a visualization of the percentages of variance explained by the different dimensions
p1 <- fviz_screeplot(res.mca, addlabels = TRUE, ylim = c(0, 15))
## Secondly, a biplot with individuals (without labels) and variables.
p2 <- fviz_mca_biplot(res.mca, 
                      label = "var",
                      repel = TRUE,
                      ggtheme = theme_minimal(),
                      col.var = "cos2",
                      gradient.cols = c("blue", "red"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## combine the plots with labeling
ggarrange(p1, p2, labels = c("A", "B"), widths=c(1,1.8), ncol = 2)
**Figure 4. A)** Percentages of inertia explained MCA dimensions. **B)** MCA biplot of individuals and variable categories.

Figure 4. A) Percentages of inertia explained MCA dimensions. B) MCA biplot of individuals and variable categories.

Interpreting the outcomes of MCA

The variables that were selected are explaining only a small proportion of the variance among the observations in different dimensions. From the scree plot (Figure 4A) we can see that dimensions 1-10 at maximum are explaining roughly 10% of the observed variation. From the biplot (Figure 4B) we can see that student category of SPC and +60 category of age_Q both have high quality of representation (cos2) on both Dim1 and Dim2.

date()
## [1] "Sun Dec 05 17:40:11 2021"